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Abstract — this paper presents a multiplier less approach 
to implement high speed and area efficient decimator for 
down converter of Software Defined Radios. This 
technique substitutes multiply-and-accumulate (MAC) 
operations with look up table (LUT) accesses. Proposed 
decimator has been implemented using Partitioned 
distributed arithmetic look up table (DALUT) algorithm 
by taking optimal advantage of embedded LUTs of target 
FPGA device. This method is useful to enhance the system 
performance in terms of speed and area. The proposed 
decimator has used half band polyphase decomposition 
FIR structure. The decimator has been designed with 
Matlab 7.6, simulated with Modelsim 6.3XE simulator, 
synthesized with Xilinx Synthesis Tool (XST) 10.1 and 
implemented on Spartan-3E based 3s500efg320-4 FPGA 
device. The proposed DALUT approach has shown an 
improvement of 24% in speed by saving almost 50% 
resources of target device as compared to MAC based 
approach. 

Index Terms— ASIC, DALUT, FPGA, MAC, SDR 

I. Introduction 

The widespread use of digital representation of 
signals for transmission and storage has created 
challenges in the area of digital signal processing [1]. 
The applications of digital FIR filter and up/down 
sampling techniques are found everywhere in modem 
electronic products. For every electronic product, lower 
circuit complexity is always an important design target 
since it reduces the cost [2]. There are many 
applications where the sampling rate must be changed. 
Interpolators and decimators are utilized to increase or 
decrease the sampling rate. Up sampler and down 
sampler are used to change the sampling rate of digital 
signal in multi rate DSP systems. This rate conversion 
requirement leads to production of undesired signals 
associated with aliasing and imaging errors. So some 
kind of filter should be placed to attenuate these errors 
[3] -[5]. Today's consumer electronics such as cellular 
phones and other multi-media and wireless devices 
often require digital signal processing (DSP) algorithms 
for several crucial operations [6] in order to increase 
speed, reduce area and power consumption. Due to a 
growing demand for such complex DSP applications, 
high performance, low-cost Soc implementations of 
DSP algorithms are receiving increased attention 
among researchers and design engineers. Although 



ASICs and DSP chips have been the traditional solution 
for high performance applications, now the technology 
and the market demands are looking for changes. On 
one hand, high development costs and time-to-market 
factors associated with ASICs can be prohibitive for 
certain applications while, on the other hand, 
programmable DSP processors can be unable to meet 
desired performance due to their sequential- execution 
architecture [7]. In this context, embedded FPGAs offer 
a very attractive solution that balance high flexibility, 
time-to-market, cost and performance. Therefore, in 
this paper, a decimator is designed and implemented on 
FPGA device. An impulse response of an FIR filter 

may be expressed as: Y = ^^-** (1) 

where Ci,C 2 C K are fixed coefficients and the xi, 

x 2 x K are the input data words. A typical digital 

implementation will require K multiply-and-accumulate 
(MAC) operations, which are expensive to compute in 
hardware due to logic complexity, area usage, and 
throughput. Alternatively, the MAC operations may be 
replaced by a series of look-up-table (LUT) accesses 
and summations. Such an implementation of the filter 
is known as distributed arithmetic (DA). 

The digital signal processing application by using 
variable sampling rates can improve the flexibility of a 
software defined radio. It reduces the need for 
expensive anti-aliasing analog filters and enables 
processing of different types of signals with different 
sampling rates. It allows partitioning of the high-speed 
processing into parallel multiple lower speed 
processing tasks which can lead to a significant saving 
in computational power and cost. Wideband receivers 
take advantage of multirate signal processing for 
efficient channelization and offers flexibility for 
symbol synchronization. 

II. Decimators 

Typically lowpass filters are used to reduce the 
bandwidth of a signal prior to reducing the sampling 
rate. This is done to minimize aliasing due to the 
reduction in the sampling rate. Down sampler is basic 
sampling rate alteration device used to decrease the 
sampling rate by an integer factor [8]. An down- 
sampler with a down-sampling factor M, where M is a 
positive integer, develops an output sequence y[n] with 
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a sampling rate that is (l/M)-th of that of the input 
sequence x[n]. The down sampler is shown in Figure 1. 



Input Sequence 
x[n] 



M 



Output Sequence 



Figure 1. Down Sampler 

Down-sampling operation is implemented by 
keeping every M th sample of x[n] and removing M-l 
in-between samples to generate y[n]. The input and 
output relation of down sampler can be expressed as: 

y[n] = x[nM] (2) 

Applying the z-transform to the input-output relation 
of a factor-of-M down-sampler, we get 



Y(z)= X x[Mn]z n 



(3) 



The expression on the right-hand side of Eq (3) 
cannot be directly expressed in terms of X(z). To get 
around this problem, a new sequence x int [n] can be 
expressed as: 



(4) 
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(5) 



Now, Xmt [n] can be formally related to x[n] as follows: 
xj[_n]=c[n]-x[n] ( 6 ) 



Where 

r i fl, n=0,±M, ±2M, ...1 
c\n\={' ' . 

[0, otherwise J 

A convenient representation of c[n] is given by 
M-l 



w 

m k=0 



Where 



W M =e 



M 
J2kIM 



(7) 



(8) 



(9) 



Taking the z-transform of Eq.(6) and by making use of 
Eq.(8), we get 

1 oo Im-i \ 



1 



n=—co \ k=0 
M-l I oo 



Ml 



(10) 



M 
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£=0 
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(11) 

The spectrum of a factor-of-2 down-sampler with an 
input x[n] is shown in Fig2. The DTFTs of the output 
and the input sequences of this down-sampler are then 
related as 



Y(e 



JW\ 



-S\x(e^ /2 ) + X(- 



J co/2 \ 



(12) 

The two terms have an overlap due to which original 
"shape" of X(e ja)/2 ) is lost when x[n] is down-sampled. 
This overlap causes the aliasing that takes place due to 
under-sampling. There is no overlap, i.e., no aliasing, 
only if 

X(e jCO )=0fov\co\>7r/2 (I 3 ) 

In general, Aliasing is absent if and only if 

X(e JCO )=0for\co\>7rlM 

To overcome the effect of aliasing decimation filters 
are used. The specifications for the lowpass decimation 
filter is given by 

j m (l, \co\<co IM 1 
\H(e JC0 )\ = \ ' ' c 

0, Ti I M <\co\<ti \ 



(15) 



III. Dalut Algorithm 

DALUT algorithm is an efficient method for 
computing inner products when one of the input vectors 
is fixed. It uses look-up tables and accumulators instead 
of multipliers for computing inner products and has 
been widely used in many DSP applications such as 
DFT, DCT, convolution, and digital filters. The 
example of direct DA inner-product generation is 
shown in Eq. (1) where Xu is a 2's-complement binary 
number scaled such that |x k | < 1. We may express each 
x k as 

IS-] 

where the bkn are the bits, or 1 , b k o is the sign bit. 
Now combining Eq. (1) and (16) in order to express y 
in terms of the bits of x k ; we see 

a (17) 

The above Eq.(17) is the conventional form of 
expressing the inner product. Interchanging the order of 
the summations, gives us: 

(18) 
Eq.(18) shows a DA computation where the bracketed 
term is given by 

(19) 
Each bim can have values of and 1 so Eq.(19) can 
have 2 K possible values. Rather than computing these 
values on line, we may pre-compute the values and 
store them in a ROM. The input data can be used to 
directly address the memory and the result. After TV 
such cycles, the memory contains the result, y. As an 
example, let us consider K = 4, Ci = 0.45, C 2 = -0.65, 
C 3 = 0.15, and C 4 = 0.55. The memory must contain all 
possible combinations (2 4 = 16 values) and their 
negatives in order to accommodate the term which 
occurs at the sign-bit time. 
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ZGA* 



(20) 



The structure that can be used to compute these 
equations is shown in Fig6. The term x k may be written 
as 



Xk = — [Xk —{—Xk}] 



(21) 

and in 2's-complement notation the negative of Xk may 
be written as 

-*— ^£i2-" + 2^' (22) 

where the over score symbol indicates the complement 

of a bit. By substituting Eq.(16) & (21) into Eq.(22), we 

- w - (23) 

A = Lftb* - h •) +£(k -fo)2-< -2^ 

In order to simplify the notation later, it is convenient 
to define the new variables as 



dim — bhn — bh 



for n=0 



(24) 



and 



ako = bko-bko ' 25 J 

where the possible values of the Clkn , including n=0, are 
1. Then Eq.(23) may be written as 



2 _<> 



(26) 



By substituting the value of x k from Eq.(26) into Eq. 
(1), we obtain 

2 M „* (27) 

.v-J V y 



a^-f^^K^-Zy 



(28) 



It may be seen that Q(b n ) has only 2 (K1) possible 
amplitude values with a sign that is given by the 
instantaneous combination of bits. The computation of 
y is obtained by using a 2 (K1) word memory, a one- word 
initial condition register for Q(O) , and a single parallel 
adder sub tractor with the necessary control-logic gates. 

IV. Proposed Decimator Design 

Equiripple based half band polyphase decimator is 
designed and implemented using Matlab [9]. The 
length of the proposed decimator filter is 66 with 0.1 
transition widths 60 dB stop band attenuation whose 
output is shown Figure2. 
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Nyquist decimators provide same stop band 
attenuation and transition width with a much lower 
order. An Zth-band Nyquist filter with L = 2 is called a 
half-band filter. The transfer function of a half-band 

filter is thus given by 

7, 



"E x (z^ 



H(z)=a+z 
with its impulse response satisfying 

a,} ^=0 

0, J otherwise 



h 2n 



(29) 



(30) 




Loop to process 
Coefficients 



Figure3. MAC based Multiplier Implementation 

In Half band filters about 50% of the coefficients of 
h[n] are zero. This reduces the hardware requirement of 
the proposed decimator significantly. The first 
decimator design is implemented by using multiplier 
technique where 67 coefficients are processed MAC 
unit as shown in Figure3. The second decimator design 
replaces MAC unit with LUT unit which is proposed 
multiplier less technique as shown in Figure4. 
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Figure4. LUT based Multiplier Less Implementation 

All 67 coefficients are divided in two parts by using 
polyphase decomposition. The 2 branch polyphase 
decomposition of an FIR decimator is shown in Figure5 
and can be expressed as: 



H(z)=E o (z 2 )+z ^(z 2 ) 
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Figure5. Polyphase Decomposition 



Figure2. Decimator Output 
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Figure6. Computationally Efficient Structure 

The proposed computationally efficient equivalent 
structure is shown in Figure6. In a DA realization of a 
FIR filter structure, a sequence of input data words of 
width W is fed through a parallel to serial shift register, 
producing a serialized stream of bits. The serialized 
data is then fed to a bit-wide shift register. This shift 
register serves as a delay line, storing the bit serial data 
samples. The delay line is tapped (based on the input 
word size W), to form a W-bit address that indexes into 
a lookup table (LUT). The LUT stores all possible 
sums of partial products over the filter coefficients 
space. The LUT is followed by a shift and adder 
(scaling accumulator) that adds the values obtained 
from the LUT sequentially. A lookup table is 
performed sequentially for each bit (in order of 
significance starting from the LSB). On each clock 
cycle, the LUT result is added to the accumulated and 
shifted result from the previous cycle. For the last bit 
(MSB), the lookup table result is subtracted, accounting 
for the sign of the operand. This basic form of DA is 
fully serial, operating on one bit at a time. If the input 
data sequence is W bits wide, then a FIR structure takes 
W clock cycles to compute the output. Symmetric and 
asymmetric FIR structures are an exception, requiring 
W+ 1 cycle, because one additional clock cycle is 
needed to process the carry bit of the pre-adders. 

The inherently bit serial nature of DA can limit 
throughput. To improve throughput, the basic DA 
algorithm can be modified to compute more than one 
bit sum at a time. The number of simultaneously 
computed bit sums is expressed as a power of two 
called the DA radix. For example, a DA radix of 2 
(2 A 1) indicates that one bit sum is computed at a time; a 
DA radix of 4 (2 A 2) indicates that two bit sums are 
computed at a time, and so on. To compute more than 
one bit sum at a time, the LUT is replicated. For 
example, to perform DA on 2 bits at a time (radix 4), 
the odd bits are fed to one LUT and the even bits are 
simultaneously fed to an identical LUT. The LUT 
results corresponding to odd bits are left-shifted before 
they are added to the LUT results corresponding to 
even bits. This result is then fed into a scaling 
accumulator that shifts its feedback value by 2 places. 
Processing more than one bit at a time introduces a 
degree of parallelism into the operation, improving 
performance at the expense of area. 

The size of the LUT grows exponentially with the 
order of the filter. For a filter with N coefficients, the 
LUT must have 2 A N values. For higher order filters, 
LUT size must be reduced to reasonable levels. To 



reduce the size in this proposed work, we can subdivide 
the LUT into a number of LUTs, called LUT partitions . 
Each LUT partition operates on a different set of taps. 
The results obtained from the partitions are summed. 
For example, for a 160 tap filter, the LUT size is 
(2 A 160)*W bits, where W is the word size of the LUT 
data. Dividing this into 16 LUT partitions, each taking 
10 inputs (taps), the total LUT size is reduced to 
16*(2 A 10)*W bits, a significant reduction. So in this 
proposed design 67 coefficients are divided into two 
sections with 34 and 33 coefficients respectively to 
perform polyphase decomposition. Then 34 coefficients 
of one part have been processed by using (666664) 
DALUT partitioning to limit the size of LUTs. This 
multiplier less DALUT technique consists of input 
registers, 4-input LUT unit and shifter/accumulator 
unit. 

V. Implementation Results & Discussion 

The multiplier based and multiplier less decimators 
are implemented and synthesized on Spartan-3E based 
3s500efg320-4 target device. The modelsim based 
simulated output of the proposed decimator with 1 6 bit 
precision is shown in Figure7. 




Figure7. Simulated Decimator Output 

Table 1 show the area, and speed comparison of both 
techniques. The proposed DA based design shows 24% 
enhancement in speed by saving almost 50% of the 
resources as compared to MAC based design. 

Table 1. Resource Utilization 



Logic 
Utilization 


Multiplier Approach 


Multiplier Less 
Approach 


# of Slices 


1055 out of 4656 

(22%) 


472 out of 4656 
(10%) 


# of Flip Flops 


1210 out of 9312 

(12%) 


515 out of 9312 

(5%) 


# of LUTs 


857 out of 93 12 (9%) 


590 out of 93 12 

(6%) 


# of Multipliers 


1 out of 20 

(5%) 


out of 20 

(0%) 


Speed (MHz) 


49.574 


61.215 
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Figure8. Resource Comparison 

The resource comparison of both multiplier and 
multiplier less techniques have been shown in Figure8. 
The multiplier approach has consumed 9-22 % 
resources as compared to 5-10% in case of multiplier 
less approach in due to efficient LUT partitioning by 
using proposed DALUT algorithm. 

Conclusion 

In this paper, an optimized half band polyphase 
decomposition technique has been presented to 
implement the decimator for wireless applications. DA 
algorithm has been used to further enhance the speed 
and area utilization of proposed design by taking an 
optimal advantage of look up table structure of target 
FPGA. The proposed multiplier approach has shown an 
improvement of 24% in speed by saving almost 50% 
resources of target device as compared to multiplier 
based approach. So proposed design is optimal one to 
provide cost effective solution for down converter 
section of Software Defined Radios 
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